Report

Data Science Job Salary Analysis

Content

Introduction

In the era of big data and technological advancement, the role of data science has become integral to driving innovation, efficiency, and strategic decision-making across diverse industries. As organizations increasingly rely on data-driven insights, the demand for skilled data scientists has surged, leading to a competitive job market where salary structures play a crucial role. This report undertakes a comprehensive exploration into the factors influencing data science salaries, aiming to uncover patterns, trends, and geographical variations that define compensation packages in this dynamic field.

Motivation:

The motivation behind this in-depth analysis lies in addressing the growing curiosity and necessity surrounding data science salaries. For aspiring data scientists, understanding the key determinants of compensation is essential in shaping career trajectories and making informed choices regarding skill development. Simultaneously, employers and industry stakeholders seek insights into the factors that attract and retain top-tier data science talent in order to remain competitive and innovative.

The field of data science is not static; it evolves with technological advancements, industry demands, and methodological innovations. Consequently, the motivation for this report is to offer a nuanced perspective on the salary landscape, moving beyond a superficial examination to delve into the specific factors that contribute to earning differentials within the profession. *Additionally, a particular emphasis will be placed on exploring which states and cities within the United States, as well as countries globally, offer the highest-paying data science roles*. By doing so, this report aims to provide a comprehensive understanding of the regional dynamics shaping data science salaries, offering valuable insights for both professionals and employers navigating the dynamic landscape of data science compensation.

In my preliminary attempt to predict salaries according to job descriptions, this report takes a step further by examining how various factors influence compensation within the dynamic realm of data science. This initial exploration sets the stage for a more comprehensive understanding of salary determinants and aims to contribute valuable insights for both professionals and employers navigating the intricate landscape of data science compensation.

Data Sources:

Data Source 1:

Data URL: https://www.kaggle.com/datasets/thedevastator/jobs-dataset-from-glassdoor/
Data Type: CSV format
Total datasize: 741 Dataset Year: 2017

About: This dataset encapsulates job postings sourced from Glassdoor.com, spanning the period of 2017-2018 and focusing on positions within the United States. The dataset is enriched with a diverse set of features providing comprehensive insights into each job listing.

Key attributes include:

Job Title: The specific designation or role associated with the job posting.
Salary Estimate: An indication of the expected salary for the corresponding position.
Job Description: A detailed overview of the responsibilities and requirements associated with the job.
Rating: The rating assigned to the company, reflecting its overall reputation.
Company Name: The name of the hiring company.
Location: The geographical location of the job.
Headquarters: The location of the company's headquarters.
Size: The size of the company in terms of the number of employees.
Founded: The year the company was established.
Type of Ownership: The ownership structure of the company.
Industry: The industry to which the company belongs.
Sector: The sector in which the company operates.
Revenue: Information about the company's revenue.
Competitors: Identifies competitors in the industry.
Hourly: Indicates if the salary is offered on an hourly basis.
Employer Provided: Specifies whether the employer provides additional benefits.
Min Salary, Max Salary, Avg Salary: Different aspects of the salary information.
Company Text: A textual representation of the company's name.
Job State: The state in which the job is located.
Same State: Indicates if the job is in the same state as the company's headquarters.
Age: The age of the company, calculated from the founding year.
Python, R, Spark, AWS, Excel: Binary indicators showcasing the technologies or skills associated with the job.
This comprehensive dataset offers a wealth of information for analysis and exploration, making it valuable for understanding trends and patterns in the job market within the specified timeframe and region.

Data Source 2:

Data URl: https://www.kaggle.com/datasets/arnabchaki/data-science-salaries-2023
Data Type: CSV format
Total datasize: 3755
Dataset Year:2020-2023 The analysis was conducted utilizing a dataset that encompasses pertinent information regarding Data Scientists from 2020-2023. The dataset comprises the following variables:

  1. work_year: The year in which the salary was paid.
  2. experience_level: The level of experience in the job during the year, categorized as:
    • EN > Entry-level / Junior
    • MI > Mid-level / Intermediate
    • SE > Senior-level / Expert
    • EX > Executive-level / Director
  3. employment_type: The type of employment for the role, identified as:
    • PT > Part-time
    • FT > Full-time
    • CT > Contract
    • FL > Freelance
  4. job_title: The specific role held during the year.
  5. salary: The total gross salary amount paid.
  6. salary_currency: The currency of the salary paid, represented as an ISO 4217 currency code.
  7. salaryinusd: The salary converted to USD.
  8. employee_residence: The primary country of residence for the employee during the work year, denoted by an ISO 3166 country code.
  9. remote_ratio: The overall proportion of work conducted remotely.
  10. company_location: The country where the employer's main office or contracting branch is located.
  11. company_size: The median number of individuals employed by the company during the year.

Method

Installation of dependencies

Begin by installing the necessary dependencies. It's crucial to note that a specific version of SQLAlchemy is required due to compatibility issues with pandas, as SQLAlchemy 2.0 is currently incompatible. Additionally, nbformat is indispensable for leveraging the "notebook" formatter, as other formatters may not be rendered correctly to HTML. Ensure the installation of these dependencies to facilitate seamless functionality and proper rendering within the notebook environment.

Importing all libraries

Loading Data

clean_salary.sqlite and global_salary.sqlite is obtained from datasource 1 and datasource 2 respectively

Datasource 1 Analysis : United States

UnitedStates

Answering broad questions

Preview of the dataset:

1. What are the Top 10 Job titles in United-States?

Inference 1: The analysis reveals a substantial prevalence of job listings for Data Scientist and Data Engineer roles, suggesting a robust demand for these professions in the United States. This observation implies a noteworthy abundance of opportunities in comparison to other data-related jobs, emphasizing the significance of these roles in the American job market.

2. What are the different types of ownership and their counts in the United States?

Inference 2: The fact that most companies are privately owned suggests there are plenty of opportunities in the private sector. For entrepreneurs, it's worth considering ventures in data technology, which is booming and gaining importance in the American business scene.

3. What are the top five industries with the most job listings in the United States?

Inference 3 : Based on the analysis, the Biotech & Pharmaceuticals industry emerges as the sector with the highest number of job listings in data science. Notably, within this industry, companies such as Genentech and Pfizer stand out for offering comparatively higher salaries, reflecting a distinct salary trend among top players in the field.

4. How is the revenue distributed among the companies?

Inference 4: The analysis of revenue distribution showcases a diverse landscape. A considerable number of companies fall under the "Unknown / Non-Applicable" category, indicating a lack of available revenue information for these entities. Among companies with disclosed revenue figures, a notable proportion belongs to the 10+ billion (USD) bracket, underscoring the presence of major corporations in the dataset. Additionally, there is a significant representation in the mid-range, with a considerable number of companies reporting revenues between 100 million to 2 billion (USD), highlighting a diverse mix of companies operating at various scales.

5. What are the high and low salary job titles in the field of data?

Inference 5:

Based on the analysis, it is evident that within the data field, the highest salaries are earned by Data Scientists, constituting 16% of the roles, followed by Senior Data Scientists at 10%. Lead Data Scientists and Lead Data Engineers secure the next positions with 6% and 3%, respectively. The hierarchy indicates that Data Scientists generally command the highest salaries, trailed by Data Engineers and Data Managers in descending order.

The analysis reveals a disparity in salaries among Data Analyst roles, with Marketing Data Analysts and Senior Data Analysts earning considerably more than Research Scientists, Staff Scientists and Junior Analysts. This highlights distinct salary variations within the Data Analyst domain, emphasizing the impact of specialization and experience on compensation levels.

6. Which city in the United States have the highest and lowest salary in the field of data?

Inference 6: The analysis indicates that San Francisco contributes approximately 21% of the top 100 highest-paying cities, showcasing its prominence in lucrative employment opportunities. Surprisingly, cities like New York, Chicago, and Boston, while featuring prominently among the highest-paying cities, constituting about 7% of the highest-paying cities,is also in city with low salary revealing regional disparities in salary distributions.

7. What is the technology requirements by high salary company age?

Inference 7: Python emerges as the predominant technology adopted universally across companies, spanning a wide age range from 7 to 173 years. Excel exhibits pervasive usage across companies irrespective of their age. While Spark and AWS find application in select companies, notably absent is the utilization of R language. Consequently, mastering Python and Excel is deemed essential for proficiency in the field.

8. How company age is a factor getting higher salary in Data Science field?

Inference 8: The analysis indicates that companies in the field of data science, which pay high salaries, are predominantly those under 20 years of age. Organizations aged between 20 and 60 years offer moderate salary ranges. Notably, there is a discernible trend suggesting that newer companies or startups have a greater propensity to offer higher salaries compared to their older counterparts.

Prediction of Salary based on the Job Description using Machine Learning Technique

This code aims to create a machine learning model for predicting job salaries based on job descriptions. It uses a pipeline with CountVectorizer for text representation and Linear Regression as the predictive model. The model is trained on this dataset, and after evaluation, it is saved to a file ('LinearRegression.pickle'). The trained model is then used to make predictions on a test set, and finally, it demonstrates the capability to predict the salary for a new job description provided as 'new_job_description'.

Inference: The code `model.score(X_test, y_test)` calculates the R-squared (R2) score for the trained model on the provided test data. In this case, the obtained score is 0.6294. The R2 score, ranging from 0 to 1, represents the proportion of the variance in the dependent variable (y_test) that is predictable from the independent variable (X_test). A score of 0.6294 suggests that approximately 62.94% of the variability in the actual job salary values is captured by the model, indicating a moderate level of predictive performance.
The scatter plot `fig` visualizes the relationship between the predicted salaries (`y_pred`) by the model and the actual salaries (`y_test`). Each point on the plot represents a data instance, comparing the predicted salary on the x-axis to the actual salary on the y-axis. This plot helps assess how well the model's predictions align with the true salary values. A diagonal alignment would indicate accurate predictions, while deviations from the diagonal line suggest discrepancies between predicted and actual values.

DataSource 2 Analysis: Global

World

Answering broad questions

Preview of the dataset

1. What are the Top 10 Job titles Globally?

Inference 1:Among the 20 most common data jobs globally, data engineering emerges as the predominant role, taking the lead, followed sequentially by data scientist, data analyst, and machine learning engineer. This hierarchy sheds light on the key roles driving the data landscape, emphasizing the critical importance of data engineering in the forefront of contemporary data-centric professions.

2. What is the experience level distribution?

Inference 2: nference 2 highlights a distinct pattern in experience levels, revealing a prevalence of experts, followed by intermediate professionals and, subsequently, junior roles.

3. Which job titles have the highest salary?

Inference 3: The examination clearly indicates that Data Science Lead commands the highest global salaries within the data field, showcasing a substantial earnings gap compared to roles such as Data Analytics Lead, Data Engineer, Machine Learning Engineer, and Data Science Manager, who all earn relatively similar moderate salaries. This insight underscores the distinct salary hierarchies across various key positions in the data domain.

4. Which are the top 10 countries who provide highest salary?

Global Map Visualization

Inference 4: The findings suggest that, on a global scale, Israel stands out for offering significantly higher salaries, denominated in USD, compared to other countries. This observation underscores a notable disparity in compensation rates between Israel and the rest of the world.

5. Which countries is hired most from company based abroad?

Inference 5 Evidently, a substantial majority of Indian employees find employment opportunities with companies based abroad, indicating a significant trend in cross-border hiring. This notable pattern emphasizes the strong global demand for Indian professionals in various sectors.

6. For remote jobs in data field, how experience level matters?

Inference 6:The analysis underscores that remote jobs offering high salaries are primarily secured by experts, with a moderate representation for intermediate professionals. However, the opportunities for junior-level positions in securing such lucrative remote roles appear comparatively limited. This insight highlights a correlation between experience levels and the likelihood of obtaining well-compensated remote positions.

7. What is the current trend of remote working?

Inference 7 The graph indicates that the peak of remote work occurred in 2021 during the COVID-19 pandemic, but there is now a discernible slowdown in the prevalence of remote jobs. This observation suggests a potential shift in the remote work landscape, reflecting changing workplace dynamics post-pandemic.

8. What is the trend of salary for Data Scientist over the year?

Inference 8 : The scatter plot reveals a consistent upward trajectory in salaries for data scientists from 2020 to 2023, indicating a steady and positive trend in compensation over the years. This observation suggests a favorable market for data science professionals, with increasing recognition and value assigned to their skill set.

Conclusion

  1. Job Demand Disparity:

    • United States vs. Global: The analysis highlights a notable discrepancy in job demand, with the United States having a higher concentration of data scientist positions, while globally, data engineering roles are in greater demand.
  2. Salary Discrepancy:

    • Data Scientist Compensation: Despite the global demand for data engineering roles, the salaries for data scientists, both globally and within the United States, surpass those of data engineers, indicating a premium on data science skills.
  3. Remote Job Expertise:

    • Expertise in Remote Jobs: Examination of available data shows that remote job opportunities predominantly favor experts, with limited prospects for intermediate and junior-level professionals.
  4. Global Technology Trends:

    • Technology Proficiency: While global data on technology skills is lacking, the US perspective implies that Python and Excel are pivotal technologies, substantiating their significance for excelling in the broader field of data.
  5. Global Salary Disparities:

    • Top Salaries Globally: The analysis identifies Israel as having the highest salaries globally, with the United States securing the third position. This sheds light on the variance in compensation structures on a global scale.
  6. International Workforce Dynamics:

    • Indian Workforce Abroad: The data reveals that Indians significantly dominate the workforce for companies based abroad, showcasing a strong international presence and the global demand for Indian talent.
  7. Post-Pandemic Remote Work Landscape:

    • Remote Job Trends Post-COVID: While there is a observed slowdown in remote job opportunities from 2020 to 2023, salaries for data scientists continue to rise. This suggests a promising future for the data science profession, demonstrating resilience and adaptability post the COVID-19 pandemic.

Discussion

Future Work:

  1. Enhancement of Salary Prediction Model:

    • The salary prediction model based on job descriptions can be refined and enhanced by exploring diverse machine learning algorithms and leveraging advanced natural language processing techniques for improved accuracy and precision.
  2. Global Salary Prediction Model:

    • Consideration should be given to establishing a comprehensive salary prediction model on a global scale. Acquiring relevant and diverse datasets would enable the calibration of salary predictions for various regions, including countries with significant economic impact like India and Germany.
  3. Cross-Industry Exploration:

    • Extend the scope of analysis beyond the data field to explore salary prediction models in other industries. This would offer insights into compensation trends, demand dynamics, and skill valuation across diverse professional domains.
  4. International Salary Comparisons:

    • Conduct a thorough examination of salary structures in different countries, such as India and Germany, to provide a comprehensive comparative analysis. This would contribute valuable insights into the global economic landscape and regional variations.
  5. Diversification Beyond Data:

    • Expand the focus beyond the data field and explore salary prediction models for diverse professional sectors. This diversification can uncover trends and patterns unique to each industry, offering a broader perspective on compensation dynamics.
  6. Exclusive Analysis of Remote Work Data:

    • Conduct a specialized analysis focusing solely on data related to remote work. This would involve delving into the specifics of remote job trends, salary variations, and the impact of remote work on compensation structures, providing valuable insights for remote work enthusiasts and employers alike.

Motivational Quote

Salary

*Your salary is the bribe they give you to forget your dreams* - "Embrace the journey of chasing your dreams, for in the pursuit of passion, success follows. Your salary may be a temporary reward, but the real treasure lies in the fulfillment of your aspirations. Don't merely run towards a paycheck; sprint towards your dreams, and watch as the currency of passion and determination transforms into the wealth of a purposeful and rewarding life."

Thank You

Arpita Halder
Masters in Artificial Intelligence
Martrikel Nr. 22974970